Scenario

Based on 2022 Singapore HDB resale price (real-life) data sets, your team is supposed to construct a multiple regression model (for one particular district) to explain the HDB resale price (ResalePrice) in dollars with the given independent variables.

Marking Criteria

Documentation and Presentation: 10 marks

Methodology: 10 marks

R-codes, computer outputs interpretation and graphical explanations: 15 marks

Recommendations and conclusions: 15 marks

Format:

Written Report: PDF Format: Within 10pages excluding the cover page and Appendix

Appendix: codes with computer outputs

You are required to provide the detailed documentation of how you search your recommended model for inference purpose and justify each step in your data analysis. You are also expected to provide model assumption justification and hypothesis testing evidences (R-codes and computer outputs) with clear explanations that your recommended model is the best model among all the models considered according to BIC criterion. Based on your final recommended model, state clearly your recommendations and conclusions.

Part 1 of Project

Load Data In

## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
Sengkang <- read.csv("data/Sengkang2023P.csv", stringsAsFactors = TRUE)

head(Sengkang, 5)
##      Date   Type Block           Street    Story Area   Model LeaseBegin        LeaseRemain
## 1 2022-01 3 ROOM  331C    ANCHORVALE ST 10 TO 12   67 Model A       2015 92 years 09 months
## 2 2022-01 3 ROOM  209C COMPASSVALE LANE 16 TO 18   67 Model A       2011 88 years 11 months
## 3 2022-01 3 ROOM  211C COMPASSVALE LANE 10 TO 12   68 Model A       2013 90 years 02 months
## 4 2022-01 3 ROOM  467B    FERNVALE LINK 04 TO 06   68 Model A       2016 93 years 08 months
## 5 2022-01 3 ROOM  414B    FERNVALE LINK 07 TO 09   68 Model A       2016  93 years 01 month
##   ResalePrice
## 1      420000
## 2      418000
## 3      410000
## 4      382000
## 5      410000

From the regression above, we know that there are too many categorical variables to consider. We have to condense them accordingly.

length(unique(Sengkang$Block))
## [1] 521
length(unique(Sengkang$Street))
## [1] 29
length(unique(Sengkang$LeaseRemain))
## [1] 226

We can see from the code chunk above that block will have 521, street has 29 and LeaseRemain has 226 categorical variables

Data Wrangling

We will use the lubridate package to adjust the year and calculate lease years used as a numeric rather than a categorical variable. Since years_used and LeaseRemain are perfected correlated, we will drop LeaseRemaind from the dataframe.

site, the HDB blocks are numbered by 100+, 200+, 300+ and 400+ in Rivervale, Compassvale, Anchorvale and Fernvale respectively.

df <- Sengkang %>%
  mutate(Date = lubridate::ym(Date),
         LeaseBegin = lubridate::ym( paste0(LeaseBegin,"-01")),
         years_used = as.numeric((Date - LeaseBegin)/365),
         subzone = ifelse(grepl("^1", Block), "Rivervale",
                          ifelse(grepl("^2", Block), "Compassvale",
                                 ifelse(grepl("^3", Block), "Anchorvale",
                                        ifelse(grepl("4", Block), "Fernvale", "others")))),
         .before = Street) %>%
  mutate(subzone= as.factor(subzone)) %>%
  select(-LeaseRemain, -LeaseBegin)

Using leaseremain as a factor will generate too many binary variables. Convert them into years_used would be easier. Date and LeaseBegin variables must be in date type before substracting between the two. The output would be in (drtn) days and thus we have to set it to numeric set to years.

Run Regression

All variables

We will first calculate the BIC of the regression of all variables.

reg_all <- lm(ResalePrice ~ ., data = df)
BIC(reg_all)
## [1] 52020.52

We have three types of location columns now, Block, Street and subzone. There should be high correlation between the X variables for these 3 variables, we will test the BIC number by dropping each variable out and picking the model with the lowest BIC

Removing Block

L1 <- lm(ResalePrice ~ .-Block, data = df)

Removing Street

L2 <- lm(ResalePrice ~ .-Street, 
         data = df)

Removing subzone

L3 <- lm(ResalePrice ~ .-subzone, 
         data = df)

It would seem removing Block would be the best. Now we will compare between street and subzone ## Removing Block and Street

L4 <- lm(ResalePrice ~ .-Block-Street, 
         data = df)

Removing Block and subzone

L5 <- lm(ResalePrice ~ .-Block-subzone, data = df)

Comparing BIC

BIC_location <- data.frame(lm = c(".-Block",".-Street",".-subzone",".-Block-Street",".-Block-subzone"),
                              BIC = c(BIC(L1),BIC(L2),BIC(L3),BIC(L4),BIC(L5)))
BIC_location
##                lm      BIC
## 1         .-Block 49828.63
## 2        .-Street 52020.52
## 3       .-subzone 52020.52
## 4  .-Block-Street 50903.32
## 5 .-Block-subzone 49875.08

Visualize L1

plot(L1)

plot(df$Date, df$ResalePrice)

plot(df$Type, df$ResalePrice)

plot(df$Block, df$ResalePrice)

plot(df$years_used, df$ResalePrice)

plot(df$subzone, df$ResalePrice)

plot(df$Street, df$ResalePrice)

plot(df$Story, df$ResalePrice)

plot(df$Area, log(df$ResalePrice))

plot(df$Model, df$ResalePrice)

residual_plot <- function(x,regression){
  n <- names(x)
  r <- residuals(regression)
  for (i in 1:length(x)){
    plot(x$n[i], r)
  }
}
residual_plot(df, L1)
r <- residuals(L1)
plot(df$Date, r,
     xlab = "Date", ylab = "Residuals")

plot(df$Type, r,
     xlab = "Type", ylab = "Residuals")

plot(df$Block, r,
     xlab = "Block", ylab = "Residuals")

plot(df$years_used, r,
     xlab = "years_used", ylab = "Residuals")

plot(df$subzone, r,
     xlab = "subzone", ylab = "Residuals")

plot(df$Street, r,
     xlab = "Street", ylab = "Residuals")

plot(df$Story, r,
     xlab = "Story", ylab = "Residuals")

plot(df$Area, r,
     xlab = "Area", ylab = "Residuals")

plot(df$Model, r,
     xlab = "Model", ylab = "Residuals")